Machine Learning on Amazon Retail Data
  • Code
  • By Bhavana
  1. Data Prep / EDA
  • Home
  • Introduction
  • Data Prep / EDA
  • Models and Methods
    • ARM (Association Rule Mining)
    • Naive Bayes
    • Clustering
    • Decision Trees
    • Neural Networks
    • Regression
    • SVM (Support Vector Machine)
  • Conclusions

On this page

  • Data Collection
  • Data Cleaning
  • Data Preprocessing / Visualization

Data Prep / EDA

Where the data source, processing, and visualization (EDA) is presented.

Data Collection

Amazon product information was scraped from the website using the API service ScraperAPI; this is because, as Amazon is a hugely popular website, they have many anti-scraping measures in place such as rate-limiting, IP blocking, dymamic loading, and such. Using the external API service, these limitations were able to be avoided. The search queries chosen to search for items were based on top 100 Amazon searches, found on this site and this site. An example of using the API, along with its core endpoint, is below.

import requests

payload = {
   'api_key': 'API_KEY',
   'query': 'iphone 15 charger',
   's': 'price-asc-rank'
}

response = requests.get('https://api.scraperapi.com/structured/amazon/search',
                        params=payload).json()

The jupyter notebook code for the web scraping can be found here.

Additionally, more data was used to supplement the existing data. Since the scraped data was only about 26K rows, a Kaggle dataset was used that contains more than one million rows, had around the same fields as the scraped data, and was also from the USA (many Amazon Kaggle datasets were from the non-US).

The raw data from both sources can be seen below in Table 1; the scraped raw data CSV can also be viewed here.

Table 1: The raw data from both datasets.
(a) The raw data scraped from Amazon using ScraperAPI
type position asin name image has_prime is_best_seller is_amazon_choice is_limited_deal stars total_reviews url availability_quantity spec price_string price_symbol price original_price section_name
0 search_product 7 B0BTBN4VP1 Evaporative Portable Air Conditioners, 700ML P... https://m.media-amazon.com/images/I/71AHsuMgEZ... False False False False 2.6 112.0 https://www.amazon.com/Evaporative-Portable-Co... NaN {} NaN NaN NaN NaN NaN
1 search_product 43 B09YSZ26TY Charmin Ultra Strong Toilet Paper 12 Mega Roll... https://m.media-amazon.com/images/I/8189tH7KLX... False False False False 4.8 372.0 https://www.amazon.com/Charmin-Ultra-Strong-To... NaN {} NaN NaN NaN NaN NaN
2 search_product 35 B09Y8NK2F3 BirkenstockArizona Waxy Narrow Leather Sandals https://m.media-amazon.com/images/I/71eaJXKHjJ... False False False False 5.0 3.0 https://www.amazon.com/Birkenstock-Arizona-Nar... NaN {} $276.95 $ 276.95 NaN NaN
3 search_product 40 B0CV9H8J4C PRETYZOOM 3pcs Shoe Storage Rack Foldable Stor... https://m.media-amazon.com/images/I/51yydWjKNs... False False False False NaN NaN https://www.amazon.com/dp/B0CV9H8J4C/ref=sr_1_... NaN {} $15.78 $ 15.78 NaN NaN
4 search_product 6 B09N6PLBDX 2pcs Car Interior Center Console Decoration Fr... https://m.media-amazon.com/images/I/61iZE2SlFi... False False False False NaN NaN https://www.amazon.com/TINKI-Interior-Decorati... NaN {} $630.00 $ 630.00 NaN NaN
(b) The raw data gotten from Kaggle
asin title imgUrl productURL stars reviews price listPrice category_id isBestSeller boughtInLastMonth
0 B0B63CFNPY JINTUM 414A Toner Cartridge (with CHIP) Compat... https://m.media-amazon.com/images/I/71CGA50Aes... https://www.amazon.com/dp/B0B63CFNPY 4.2 572 139.99 149.99 72 False 0
1 B08FJCGXKR 4 Pieces Words Christmas Clear Stamps Set Sili... https://m.media-amazon.com/images/I/81UlHQvTTJ... https://www.amazon.com/dp/B08FJCGXKR 4.5 0 10.99 0.00 5 False 50
2 B07NPXM3G4 5 Pair of Half Insoles - Shoe Filler, Half-Siz... https://m.media-amazon.com/images/I/81Q2IsZYdw... https://www.amazon.com/dp/B07NPXM3G4 3.9 0 9.95 0.00 131 False 400
3 B0C7CGPGZC Disney Cars Mini Racers 15-Pack Variety Radiat... https://m.media-amazon.com/images/I/81ND3IJWyC... https://www.amazon.com/dp/B0C7CGPGZC 5.0 0 64.95 0.00 236 False 0
4 B07SH21322 Crinkle Cut Paper Shred Filler Royal Blue & Go... https://m.media-amazon.com/images/I/81ORI0gpx3... https://www.amazon.com/dp/B07SH21322 4.1 0 10.95 0.00 12 False 50

Data Cleaning

The datasets were cleaned seperately, then concatenated, then some final steps were taken to clean it.

The steps to clean the web-scaped data were:

  • Add date_scraped column
  • Remove unecessary columns: type, position, has_prime, is_amazon_choice, is_limited_deal, availability_quantity, spec, price_string, price_symbol, section_name
  • Expand and fix original_price
  • Rename columns to match standard snake case for merging both datasets
  • Drop rows with no asin or name or price
  • Drop rows with price of 0.0, since that doesn’t make sense
  • Fill NaN reviews column with 0

The steps to clean the Kaggle data were:

  • Add date_scraped column
  • Remove unecessary columns boughtInLastMonth
  • Drop rows with any NaNs
  • Fix list_price of $0 to be instead equal to price
  • Change category_id to actual category by using category table
  • Drop rows with price of $0, since that doesn’t make sense
  • Rename columns to match standard snake case for merging both datasets

And then, after they were concatenated, the steps to clean were:

  • Remove duplicates (by asin + date scraped)
  • Rename columns

The final cleaned (and concatenated) dataset can be seen in Table 2 (with the original raw data in Table 1):

Table 2: The final unioned, cleaned, and processed data.
Asin Name Image Url Is Best Seller Stars Reviews Url Price Date Scraped List Price Category
0 B0C6TLWQ2V Pllieay 5 in 1 Plastic Cross-Stitch Hoops, Pla... https://m.media-amazon.com/images/I/51T8EHasQR... False 4.3 0.0 https://www.amazon.com/dp/B0C6TLWQ2V 20.99 2023-11-01 21.99 Needlework Supplies
1 B08GL3CVLF Toddlers and Baby Boys' Swimsuit Trunk and Ras... https://m.media-amazon.com/images/I/81XD7QCQb7... True 4.8 0.0 https://www.amazon.com/dp/B08GL3CVLF 23.80 2023-11-01 23.80 Baby Boys' Clothing & Shoes
2 B0BMG7TTLV Hi Vis Rain Jacket, Class 3 High Visibility Ra... https://m.media-amazon.com/images/I/61NmeMU7vk... False 4.2 0.0 https://www.amazon.com/dp/B0BMG7TTLV 70.99 2023-11-01 70.99 Safety & Security
3 B0CFQW79K8 Tea Infuser Strainers for Loose Tea, Stainless... https://m.media-amazon.com/images/I/71vpVzzzbg... False 4.7 0.0 https://www.amazon.com/dp/B0CFQW79K8 7.99 2023-11-01 7.99 Kitchen & Dining
4 B01M25PXTZ Younik Vertical Stand for PS-4 Slim, Built-in ... https://m.media-amazon.com/images/I/71i8zRx4vv... False 4.6 0.0 https://www.amazon.com/dp/B01M25PXTZ 11.99 2023-11-01 11.99 Legacy Systems

The code for the data cleaning can be found here.

Data Preprocessing / Visualization

Various types of EDA were performed in order to examine the data; as a note, most visuals are interactive (zoomable, pannable, etc). The code for all visualizations can be found here.

Important

If the interactive figures don’t load, dont worry: just turn off all ad-blockers/privacy browsing, make sure you are using Chrome/Firefox, and refresh the page until all figures load.

Figure 1

Figure 1: A histogram of all categories of all Amazon products. Note scraped data did not have categories, but the Kaggle data did.

Figure 2

Figure 2: Stars vs number of reviews recieved by an amazon product, colored by whether the product was a best-seller.

Figure 3

Wordcloud for categories of products

Wordcloud for the names of products
Figure 3: Wordclouds (where more frequent appearing words are bigger) of the categories of products and the names of products.

Figure 4

Figure 4: Price vs list price of items with the same ASIN across dates scraped, with a trendline.

Given that we can see outliers in price affecting the plot of the graph, it was decided for analysis to only consider those prices most populous, aka prices less than $800.

Figure 5

Figure 5: Histogram of prices, colored by whether the change in price increased or decreased over time, for those items that were in both sets of data.

Figure 6

Figure 6: Price vs the difference in price, over the two sets of data, colored by whether the price diff increased or decreased.
Introduction
ARM (Association Rule Mining)